"Team Stuttgart and Tübingen – GeneiusVis"

 

VAST 2010 Challenge

Genetic Sequences - Tracing the Mutations of a Disease

 

 

Authors and Affiliations:

Julian Heinrich, University of Stuttgart, julian.heinrich@visus.uni-stuttgart.de

Andre Burkovski, University of Stuttgart, andre.burkovski@visus.uni-stuttgart.de

Florian Battke, University of Tübingen, florian.battke@uni-tuebingen.des

Alexander Herbig, University of Tübingen, alexander.herbig@uni-tuebingen.de

Stephan Symons, University of Tübingen, symons@informatik.uni-tuebingen.de

Kay Nieselt, University of Tübingen, kay.nieselt@uni-tuebingen.de

 

Tool(s):

In order to solve this mini challenge a visual analytics tool was developed, integrating the result of calculating a phylogenetic tree using the neighbor joining method. For the computation of the phylogenetic trees, ClustalX was used. ClustalX is a multiple sequence alignment program, however the alignment procedures were not used since the sequences for the challenge were already aligned. The phylogenetic tree was exported as a Newick file, a standard file format used in phylogenetics. The developed tool ‘GeneiusVis’, consists of two linked views: a Tree Visualizer offering different layouters for phylogenetic trees as well as interactive node and edge selection, and an alignment viewer for multiple sequence alignments allowing to trace mutations of a disease. Selections of rows in the alignment viewer are linked to the respective nodes in the Tree Visualizer and vice versa. The alignment viewer provides interactive computation of consensus sequences from selected rows. The consensus sequence represents the most frequent nucleotide at each position of the selected sequences.

Additionally, R was used to compute the mutual information of pairs of columns in a multiple sequence alignment, as well as to determine the mean evolutionary divergence between two groups of sequences. Finally, the WEKA library was used to validate the findings.

 

Video:

 

Video

 

 

ANSWERS:

 


 

MC3.1: What is the region or country of origin for the current outbreak? 

 

Answer: Nigeria_B

 

To determine the origin of the virus associated with an outbreak of the Drafa virus, we conducted genetic analyses of all native sequences and those of the current disease outbreak. Phylogenetic analyses based on the nucleotide sequences showed that all viral sequences from the disease outbreak are very closely related and cluster monophyletically. This proves that all strains from the current outbreak have one common ancestor, the strain from Nigeria B (highlighted red in Figure 1). The same answer is found when using amino acid sequences. In addition, we computed the average nucleotide divergence between the current Drafa viruses and the native strain sequences, using R. The minimum is 0.010799 which again is the divergence to Nigeria B.

 

FullTree

Figure 1: A phylogenetic tree of all sequences in the Tree Visualizer. The native strain sharing the lowest common ancestor with all strains of the current outbreak is highlighted in red.

 


 

MC3.2: Over time, the virus spreads and the diversity of the virus increases as it mutates. Two patients infected with the Drafa virus are in the same hospital as Nicolai. Nicolai has a strain identified by sequence 583. One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51. Assume only a single viral strain is in each patient. Which patient likely contracted the illness from Nicolai and why?

 

Answer: ID 123

 

After performing a phylogenetic analysis of the nucleotide sequences restricted to the sequences of the current disease outbreak we imported the resulting tree in the Tree Visualizer. The Alignment Viewer can be used to sort and select IDs more efficiently. The corresponding nodes are simultaneously selected and highlighted in the tree. We selected the three IDs with label 583, 123 and 51, and interactively labeled them with red. From the tree, it is evident that 123 is much closer to 583 than the sequence with ID 51. This is validated by the evolutionary divergence of 583 and 123 which is 0.000713, while the evolutionary divergence of 583 and 51 is 0.002141. The same answer is found when using amino acid sequences.

 

DiseaseTree

mc3

Figure 2: Phylogenetic tree and alignment of all current outbreak sequences. Selection of strains in the alignment viewer automatically highlights the respective nodes in the tree.

 


 

MC3.3: Signs and symptoms of the Drafa virus are varied and humans react differently to infection. Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them.

Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic). The mutations involve one or more base substitutions. For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C -> G, 456 (C changed to G at position 456)

G -> A, 513 and T -> A, 907 (G changed to A at position 513 and T changed to A at position 907)

A -> G, 39 (A changed to G at position 39)

 

Answer:

A -> G, 223

A -> C, 269

T -> C, 109

 

For all following tasks, nucleotides and disease characteristics (with increasing severity from 0 to 2) have been mapped to colors and opacity.

mc3

Figure 3: Alignment Viewer with disease characteristics.

 

We sorted all sequences by symptom severity and computed the consensus sequence for every severity group. The color of the consensus nucleotide corresponds to the most frequent one of the group while opacity reflects its relative frequency and thereby the degree of conservation in the respective group. The most prominent correlation of columns with the opacity of symptom severity occur in columns 22, 79, 109, 161, 223, 269, 842 and 946. As positions 22, 79, 161, 842 and 946 turn out to be correlated with other disease characteristics (see below), only 109, 223 and 269 remain.

 

mc3

Figure 4: Three consensus sequences grouped by strains with equal symptom severity.

 


 

MC3.4:  Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question. To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.

Consider each virulence and drug resistance characteristic as equally important. Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions.  In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups. For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C -> G, 456 (C changed to G at position 456)

G -> A, 513 and T -> A, 907 (G changed to A at position 513 and T changed to A at position 907)

A -> G, 39 (A changed to G at position 39).

 

Answer:

T -> C, 842 and A -> T, 946 

G -> C, 161 and T -> C, 790

G -> C, 22 and C -> A, 79

 

Our general approach is to associate the genotype of a strain with its disease characteristics, the phenotype, in a matrix based alignment view. Symptom characteristics were mapped to integers and added as meta information to the nucleotide alignment. An additional column summing the symptoms for each patient was added as scoring function for overall virulance. Columns can be moved, hidden and sorted. In the alignment viewer, each position is colored either by nucleotide or attribute value. For attributes, a single-hue (red) was used with opacity denoting the attribute value. If rows are selected, a consensus sequence can be computed which is then shown instead of the rows it represents. The nucleotide of the consensus sequence at position i is chosen as the one with largest frequency: arg maxc {f(c,i), c in {A,G,C,T}}. Here, opacity is mapped to the relative frequency of the nucleotide in the consensus, reflecting the degree of conservation of a nucleotide in the consensus. Several consensus sequences can also be joined to a new consensus sequence, allowing the user to interactively build a ‘consensus tree’. Finally, for attribute values the average is taken as consensus instead of the most frequent occurrence. This makes sense, as disease characteristics have been mapped to linear scale previously.

We first noted that many positions in the alignment are perfectly conserved, i.e. all strains have an identical nucleotide at that position. Since conserved positions cannot contribute to virulence of a mutant, we (automatically) removed these positions from further considerations. We also removed all columns for which just one strain differed from the other strains. These singular mutations define the individual strain identity, but not mutant virulence.

We then compared the remaining 14 columns with the virulance score (the sum over all attribute values). The entries of the score range from 1 to 8. First, we sorted all rows according to the virulance score. Next we hypothesized whether all 8 levels of phenotypes are represented by different consensus sequences. The main idea is to find visual correlations of the opacity of an alignment column with the opacity of the phenotype attributes. Again we searched for the visually most prominent correlation of alignment and attribute columns. However, for the remaining 14 columns we see that such a fine resolution into 8 phenotypes is not reflected by the consensus sequences (Figure 5).

 

mc3

Figure 5: 8 consensus sequences of all sequences with virulance score levels 1, 2, ..., 8, respectively.

mc3

Figure 6: 4 consensus sequences corresponding to 4 levels of overall virulence. For clarity only the colors and corresponding opacity is shown. The tooltip of a cell show the underlying distribution of nucleotides.

 

Since our tool efficiently allows the visual comparison of consensus sequences that can be quickly computed from selected rows, we gradually reduced the resolution of the overall phenotype groups from 8 to 4. The 4 phenotypes would correspond to very low (scores 1 and 2, low (scores 3,4), medium (scores 5,6) and high (scores 7,8) (Figure 6). We see four overall patterns of alignment columns:

 

1: decreasing opacity, indicating increasing mutation rate with increased virulence

2: increasing opacity, indicating decreasing mutation rate with increased virulence

3: one color but not with steadily increasing or decreasing opacity, and

4: columns with more than one color, indicating that the majority of strains in that group have a different nucleotide at that position than all other strains.

 

Positions 842, 161 and 790 belong to pattern 1, positions 22, 79 and 946 are from pattern number 4.

However, we also noted an uncertainty to decide. Other positions could also be chosen as candidates. One observation that we made is that individual disease characteristics lead to different contributions from individual positions. We successively repeated our analyses with each of the 6 attribute columns, and found that positions 161 and 790 mainly lead to worse 'complications', positions 22,79 (and possibly 1033) and 842,946 nicely correlate with 'drug resistance' (Figure 7). As ‘complications’ seem to become more severe with ‘increasing drug resistance’, these positions obviously cause increasing overall virulance.

Figure 7: The reduced alignment with attributes but without labels, sorted by ‘complications’ and ‘drug resistance’.

 

Using color without labels, as in Figure 7, helps the researcher to identify patterns among columns. E.g., sorting rows according to ‘drug resistance’, we immediately see that it is correlated with position 790: all strains with major complications have a ‘C’ (blue color), while the other strains except for two strains with minor complications have a ‘T’ (purple color) at that position. We also nicely see the correlated mutation patterns of positions 22, 79 and possibly 1033 as well as positions 842 and 946.

 

Using the statistics package R, we also determined which positions are correlated. Therefore we computed the pairwise mutual information and visualized the results in a heatmap (Figure 8). The heatmap quickly allows the identification of cells with large mutual information values, which correspond to pairs of highly correlated columns.

 

CurrentOutbreakSeq_MI_sub-sub

Figure 8: Heatmap of pairwise mutual information of selected columns in alignment.

 

We see several possible pairs of correlated mutations. Position 161, 790, 842 and 946 are highly correlated. Furthermore positions 22 and 79 show a significant correlation, as well as with position number 161. Altogether our top mutations for overall virulence are columns 842, 946 and columns 22,79 and columns 790,161.

 

Efforts: one team member applied the tools offered by the WEKA library to identify the best features to do a classification of the strains with respect to their symptoms (1 day). Another team member implemented a mutual information analysis in R in order to identify pairwise correlated columns in the alignment (1 day). One team member implemented the alignment viewer using the Qt toolkit in about one week. The tree visualizer was already implemented in Java and needed only to be extended for communication with the alignment viewer (1 day).